We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine-tuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.
translated by 谷歌翻译
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.
translated by 谷歌翻译
We present a variety of new architectural features and training procedures that we apply to the generative adversarial networks (GANs) framework. We focus on two applications of GANs: semi-supervised learning, and the generation of images that humans find visually realistic. Unlike most work on generative models, our primary goal is not to train a model that assigns high likelihood to test data, nor do we require the model to be able to learn well without using any labels. Using our new techniques, we achieve state-of-the-art results in semi-supervised classification on MNIST, CIFAR-10 and SVHN. The generated images are of high quality as confirmed by a visual Turing test: our model generates MNIST samples that humans cannot distinguish from real data, and CIFAR-10 samples that yield a human error rate of 21.3%. We also present ImageNet samples with unprecedented resolution and show that our methods enable the model to learn recognizable features of ImageNet classes.
translated by 谷歌翻译
In recent years, supervised learning with convolutional networks (CNNs) has seen huge adoption in computer vision applications. Comparatively, unsupervised learning with CNNs has received less attention. In this work we hope to help bridge the gap between the success of CNNs for supervised learning and unsupervised learning. We introduce a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrate that they are a strong candidate for unsupervised learning. Training on various image datasets, we show convincing evidence that our deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator. Additionally, we use the learned features for novel tasks -demonstrating their applicability as general image representations.
translated by 谷歌翻译
Many dynamical systems exhibit latent states with intrinsic orderings such as "ally", "neutral" and "enemy" relationships in international relations. Such latent states are evidenced through entities' cooperative versus conflictual interactions which are similarly ordered. Models of such systems often involve state-to-action emission and state-to-state transition matrices. It is common practice to assume that the rows of these stochastic matrices are independently sampled from a Dirichlet distribution. However, this assumption discards ordinal information and treats states and actions falsely as order-invariant categoricals, which hinders interpretation and evaluation. To address this problem, we propose the Ordered Matrix Dirichlet (OMD): rows are sampled conditionally dependent such that probability mass is shifted to the right of the matrix as we move down rows. This results in a well-ordered mapping between latent states and observed action types. We evaluate the OMD in two settings: a Hidden Markov Model and a novel Bayesian Dynamic Poisson Tucker Model tailored to political event data. Models built on the OMD recover interpretable latent states and show superior forecasting performance in few-shot settings. We detail the wide applicability of the OMD to other domains where models with Dirichlet-sampled matrices are popular (e.g. topic modeling) and publish user-friendly code.
translated by 谷歌翻译
高斯工艺(GPS)是贝叶斯非参数模型,由于其准确性和天然不确定性定量(UQ),因此在各种应用中流行。调整GP超参数对于确保预测准确性和不确定性的有效性至关重要。独特地估计多个超参数,例如Matern内核也可能是一个重大挑战。此外,大规模数据集中的培训GPS是一个高度活跃的研究领域:传统的最大似然超参数训练需要二次记忆以形成协方差矩阵并具有立方训练的复杂性。为了解决可扩展的超参数调整问题,我们提出了一种新型算法,该算法估算了Matern内核中的平滑度和长度尺度参数,以提高所得预测不确定性的鲁棒性。使用与超参数估计算法MUYGPS提供的计算框架中的合并预测算法相似的新型损失函数,我们在数值实验中证明了高度可伸缩性,同时保持了高度可伸缩性。
translated by 谷歌翻译
本文介绍了我们对SMM4H 2022共享任务的提交,内容涉及自我报告的亲密伴侣暴力在Twitter上(英语)。这项任务的目的是准确确定给定推文的内容是否证明了某人报告自己的亲密伴侣暴力经历。提交的系统是五个罗伯塔模型组成的合奏,每个模型各自在验证数据集上由各自的F1分数加权。该系统的性能比基线要好13%,并且是该共享任务的总体性能最佳系统。
translated by 谷歌翻译
明显大小的时间变化(称为光曲线)是望远镜在长时间内捕获的感兴趣的观察统计。光曲线提供了空间域意识(SDA)目标(例如对象识别或姿势估计)作为潜在变量推理问题等目标的探索。与较高的精确仪器相比,来自货架上商业架子(COTS)摄像机的地面观测仍然很便宜,但是,有限的传感器可用性与嘈杂的观察结果相结合,可能会产生可能难以建模的gappy时间序列数据。这些外部因素混淆了对光曲线的自动开发,这使光曲线预测和外推成为应用的关键问题。传统上,使用基于扩散或基于示例的方法解决了图像或时间序列的完成问题。最近,由于学习复杂的非线性嵌入方面的经验成功,深度神经网络(DNNS)已成为首选工具。但是,DNN通常需要大量的培训数据,而这些数据不一定在查看单个卫星的光曲线的独特功能时可用。在本文中,我们提出了一种新的方法,可以使用高斯工艺(GPS)预测光曲线的缺失和未来数据点。 GPS是非线性概率模型,可推断后验分布在功能上并自然量化不确定性。但是,GP推理和培训的立方缩放是其在应用中采用的主要障碍。特别是,单个光曲线可以具有数十万个观测值,这远远超出了单个机器上常规GP的实际实现极限。因此,我们采用MUYGP,这是一种可扩展的框架,用于使用最近的邻居稀疏和局部交叉验证的GP模型的超参数估计。 muygps ...
translated by 谷歌翻译
我们考虑在平均场比赛中在线加强学习。与现有作品相反,我们通过开发一种使用通用代理的单个样本路径来估算均值场和最佳策略的算法来减轻对均值甲骨文的需求。我们称此沙盒学习为其,因为它可以用作在多代理非合作环境中运行的任何代理商的温暖启动。我们采用了两种时间尺度的方法,在该方法中,平均场的在线固定点递归在较慢的时间表上运行,并与通用代理更快的时间范围内的控制策略更新同时进行。在足够的勘探条件下,我们提供有限的样本收敛保证,从平均场和控制策略融合到平均场平衡方面。沙盒学习算法的样本复杂性为$ \ Mathcal {o}(\ epsilon^{ - 4})$。最后,我们从经验上证明了沙盒学习算法在交通拥堵游戏中的有效性。
translated by 谷歌翻译
视觉变压器(VIT)正在出现,并且在计算机视觉任务中的准确性显着提高。但是,它们的复杂架构和巨大的计算/存储需求对新硬件加速器设计方法施加了紧迫的需求。这项工作提出了基于提议的混合速度量化的FPGA感知自动VIT加速框架。据我们所知,这是探索模型量化的第一个基于FPGA的VIT加速框架。与最先进的VIT量化工作(仅无硬件加速的算法方法)相比,我们的量化在相同的位宽度下可实现0.47%至1.36%的TOP-1精度。与32位浮点基线FPGA加速器相比,我们的加速器在框架速率上的提高约为5.6倍(即56.8 fps vs. 10.0 fps),对于DeitBase的ImagEnet数据集,精度下降了0.71%。
translated by 谷歌翻译